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Abstract — We consider a multi-channel opportunistic com- 
munication system where the states of these channels evolve 
as independent and statistically identical Markov chains (the 
Gilbert-Elliot channel model). A user chooses one channel to 
sense and access in each slot and collects a reward determined by 
the state of the chosen channel. The problem is to design a sensing 
policy for channel selection to maximize the average reward, 
which can be formulated as a multi-arm restless bandit process. 
In this paper, we study the structure, optimality, and performance 
of the myopic sensing policy. We show that the myopic sensing 
policy has a simple robust structure that reduces channel selection 
to a round-robin procedure and obviates the need for knowing 
the channel transition probabilities. The optimality of this simple 
policy is established for the two-channel case and conjectured for 
the general case based on numerical results. The performance of 
the myopic sensing policy is analyzed, which, based on the opti- 
mality of myopic sensing, characterizes the maximum throughput 
of a multi-channel opportunistic communication system and 
its scaling behavior with respect to the number of channels. 
These results apply to cognitive radio networks, opportunistic 
transmission in fading environments, downlink scheduling in 
centralized networks, and resource-constrained jamming and 
anti-jamming. 

Index Terms: Opportunistic access, cognitive radio, multi-channel 
MAC, multi-arm restless bandit process, myopic policy. 

I. Introduction 

A. Multi-Channel Opportunistic Access 

The fundamental idea of opportunistic access is to adapt the 
transmission parameters (such as data rate and transmission 
power) according to the state of the communication environ- 
ment including, for example, fading conditions, interference 
level, and buffer state. Since the seminal work by Knopp 
and Humblet in 1995 [1], the concept of opportunistic access 
has found applications beyond transmission and scheduling 
over fading channels. An emerging application is cognitive 
radio for opportunistic spectrum access, where secondary users 
search in the spectrum for idle channels temporarily unused by 
primary users [2]. Another application is resource-constrained 
jamming and anti-jamming, where a jammer seeks channels 
occupied by users or a user tries to avoid jammers. 

We consider a general opportunistic communication system 
where a user has access to N parallel channels and chooses 
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one channel to sense and access in each slot, aiming to maxi- 
mize its expected long-term reward (i.e., throughput). This user 
can be a base station, and each channel is associated with a 
downlink receiver. In this case, channel selection is equivalent 
to receiver selection, and the general problem considered here 
also applies to downlink scheduling in a centralized network. 

These N channels are modelled as independent and stochas- 
tically identical Gilbert-Elliot channels [3], which has been 
commonly used to abstract physical channels with memory 
(see, for example, [4], [5]). As illustrated in Fig. [T] the state 
of a channel — good or bad — indicates the desirability of 
accessing this channel and determines the resulting reward. 
For example, for the application of cognitive radio networks, 
the good state represents an unused channel by primary users 
while the bad state an occupied channeQ. The transitions 
between these two states follow a Markov chain with transition 
probabilities {Pij}i,j=o,i- 
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Fig. 1. The Gilbert-Elliot channel model. 

A sensing policy that governs the channel selection in each 
slot is crucial to the efficiency of multi-channel opportunistic 
access. The design of the optimal sensing policy can be 
formulated as a partially observable Markov decision process 
(POMDP) for generally correlated channels, or a restless 
multi-armed bandit process for independent channels. Unfor- 
tunately, obtaining the optimal policy for a general POMDP 
or restless bandit process is often intractable due to the 
exponential computation complexity. 

A common approach of trading performance for tractable 
solutions is to consider myopic policies. A myopic policy 
aims solely at maximizing the immediate reward, ignoring the 
impact of the current action on the future reward. Obtaining 
a myopic policy is thus a static optimization problem instead 
of a sequential decision-making problem. As a consequence, 
the complexity is significantly reduced, often at the price of 
considerable performance loss. 

In this paper, we show that for designing sensing strategies 
for multi-channel opportunistic access, low complexity does 
not necessarily imply suboptimal performance. The myopic 

'When the primary network employs load balancing across channels, the 
occupancy process of all channels can be considered stochastically identical. 
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sensing policy with a simple and robust structure achieves 
the optimal performance under the i.i.d. Gilbert-Elliot channel 
model. 

B. Contribution 

Under the i.i.d. Gilbert-Elliot channel model, we establish 
the structure and optimality of the myopic sensing policy and 
analyze its performance. 

1) Structure of Myopic Sensing: The first contribution of 
this paper is the establishment of a simple and robust structure 
of the myopic sensing policy. Besides significant implications 
in the practical implementation, this result serves as the key 
to the optimality proof and the performance analysis. 

We show that the basic structure of the myopic policy is 
a round-robin scheme based on a circular ordering of the 
channels. For the case of p%% > pox, the circular order is 
constant and determined by the initial information (if any) on 
the state of each channel. The myopic action is to stay in the 
same channel when it is good (state 1) and switch to the next 
channel in the circular order when it is bad. In the case of 
Pn < Poi> the circular order is reversed in every slot with the 
initial order determined by the initial information on channel 
states. The myopic policy stays in the same channel when it is 
bad; otherwise, it switches to the next channel in the current 
circular or dej| 

The significance of this result in terms of the practical 
implementations of myopic sensing is twofold. First, it demon- 
strates the simplicity of myopic sensing: channel selection is 
reduced to a simple round-robin procedure. The myopic sens- 
ing policy requires no computation and little memory. Second, 
it shows that myopic sensing is robust to model mismatch. 
Specifically, the myopic sensing policy has a semi-universal 
structure; it can be implemented without knowing the channel 
transition probabilities. The only required information about 
the channel model is the order of p\\ and p§\. As a result, the 
myopic sensing policy automatically tracks variations in the 
channel model provided that the order of pu and poi remains 
unchanged. Note that when pn = poi, channel states become 
independent in time; all channel selections lead to the same 
performance. We thus expect that myopic sensing is robust to 
estimation errors in the order of pn and poi, which usually 
occur when pn w po%. This has been confirmed by simulation 
results [6]. 

2) Optimality of Myopic Sensing: Surprisingly, the myopic 
sensing policy with such a simple and robust structure is, in 
fact, optimal as established in this paper for N — 2. Based 
on numerical results, we conjecture that the optimality of the 
myopic policy can be generalized to N > 2. The optimality 
along with the simple and robust structure makes the myopic 
sensing policy particularly appealing. 

In a recent work [8], based on the structure of the myopic 
policy, the optimality result has been extended to N > 2 under 

2 It is easy to show that pn > poi corresponds to the case where the 
channel states in two consecutive slots are positively correlated, i.e., for any 
distribution of S(t), we have E[(S(i) -E[S(t)])(S(t + 1) -E[S(t + 1)])] > 
0, where S(t) is the state of the Gilbert-Elliot channel in slot t. Similar, 
Pn < Poi corresponds to the case where S(t) and S(t + 1) are negatively 
correlated, and pu = poi the case where S(t) and 5(4 + 1) are independent. 



the condition of pn > poi- While numerical results indicate 
that for a wide range of p\\ and p$\, the myopic policy is 
also optimal for N > 2 with p\\ < poi> pathological cases 
where optimality fails have been found when poi — p\\ is 
close to 1. Nevertheless, the performance loss of the myopic 
policy in these cases is minimal and tends to diminish with 
the horizon length. Establishing necessary and/or sufficient 
conditions (potentially in the form of bounding poi ~ V\\) 
under which the myopic policy is optimal for p\\ < poi 
appears to be challenging. It is our hope that results and 
approaches presented in this paper, in particular, the simple 
structure of the myopic policy, may stimulate fresh ideas for 
completing the picture on the optimality of the myopic policy. 

3) Performance of Myopic Sensing: The optimality of the 
myopic sensing policy motivates the performance analysis, as 
its performance defines the throughput limit of a multi-channel 
opportunistic communication system under the i.i.d. Gilbert- 
Elliot channel model. We are particularly interested in the 
relationship between the maximum throughput and the number 
of channels. 

Closed-form expressions for the performance of POMDP 
and restless bandit policies are rare. For this problem at hand, 
the simple structure of the myopic policy again renders an 
exception. Specifically, based on the structure of the my- 
opic policy, we show that its performance is determined by 
the stationary distributions of a higher-order countable-state 
Markov chain. For N = 2, we have a first-order Markov 
chain whose stationary distribution can be obtained in closed- 
form, leading to exact characterizations of the throughput. 
For N > 2, we construct first-order Markov processes that 
stochastically dominate or are dominated by this higher-order 
Markov chain. The stationary distributions of the former, again 
obtained in closed-forms, lead to lower and upper bounds that 
monotonically tighten as the number N of channels increases. 

These analytical characterizations allow us to study the rate 
at which the maximum throughput of an opportunistic system 
increases with N, and to obtain the limiting performance as 
N approaches to infinity. Our result demonstrates that the 
maximum throughput of a multi-channel opportunistic system 
with single-channel sensing saturates at geometric rate as the 
number of channels increases. This result suggests to system 
designers the importance of having radios capable of sensing 
multiple channels in order to fully exploit the communication 
opportunities offered by a large number of channels. 

C. Related Work 

The structure, optimality, and performance analysis of my- 
opic sensing in the context of opportunistic access may bear 
significance in the general context of restless multi-armed 
bandit processes. While an index policy (Gittins index [11]) 
is known to be optimal for the classical bandit problems, the 
structure of the optimal policy for a general restless bandit 
process remains unknown, and the problem is shown to be 
PSPACE-hard [12]. Whittle proposed a Gittins-like heuristic 
index policy for restless bandit problems [7], which is asymp- 
totically optimal in certain limiting regime [13]. Beyond this 
asymptotic result, relatively little is known about the structure 
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Pxx, a(t) =i,S a ( t )(t) = 1 

Vi(t + 1)={ p x, a(t) =i,S a(it )(t) = 



(1) 



of the optimal policies for a general restless bandit process. 
The existing literature mainly focuses on approximation algo- 
rithms and heuristic policies [9], [10]. The optimality of the 
myopic policy shown in this paper suggests non-asymptotic 
conditions under which an index policy can be optimal for 
restless bandit processes. 

The results presented in this paper apply to cognitive radio 
networks, which has received increasing attention recently. In 
this context, the design of sensing policies for tracking the 
rapidly varying spectrum opportunities has been addressed 
in [14], [15] under a general Markvian model of potentially 
correlated channels, where a POMDP framework has been 
developed. 

This paper is also related to channel probing and trans- 
mission strategies in multichannel wireless systems (see [16]- 
[19] and references therein). In contrast to the Markovian 
model considered in this paper, these existing results adopt 
a memoryless channel model. 

II. Problem Formulation 

We consider the scenario where a user is trying to access 
N independent and stochastically identical channels using a 
slotted transmission structure. The state Si(t) of channel i in 
slot t is given by a two-state Markov chain shown in Fig. [TJ 
At the beginning of each slot, the user selects one of the N 
channels to sense. If the channel is sensed to be good (state 1), 
the user transmits and collects one unit of reward. Otherwise, 
the user does not transmit (or transmits at a lower rate), collects 
no reward, and waits until the next slot to make another choice. 
The objective is to maximize the average reward (throughput) 
over a horizon of T slots by choosing judiciously a sensing 
policy that governs channel selection in each slot. 

Due to limited sensing, the full system state 
[Si(£), ••• ,S N (t)} £ {0, 1} N in slot t is not observable. 
The user, however, can infer the state from its decision 
and observation history. It has been shown that a sufficient 
statistic for optimal decision making is given by the 
conditional probability that each channel is in state 1 
given all past decisions and observations [20]. Referred to 
as the belief vector, this sufficient statistic is denoted by 

Sl(t) = \uJx(t), • ■ • , u>N(t)]> where Ui(t) is the conditional 
probability that Si(t) = 1. Given the sensing action a(t) 
and the observation S a (t-)(t) in slot t, the belief vector for 
slot t + 1 can be obtained via Bayes Rule as given in (fl~|i. 

A sensing policy it specifies a sequence of functions it — 
[tii, 7T2, • • • ,ttt], where ir t is the decision rule at time t 
that maps a belief vector fl(t) to a sensing action a(t) £ 
{1, • ■ • , N} for slot t. Multi-channel opportunistic access can 
thus be formulated as the following stochastic control problem. 



where ir t (n(t)) is the channel selected and R- !rt (n(t)){t) = 
S-7r t (n(t))(t) the reward so obtained when the belief is ft(t), 
and is the initial belief vector. If no information about 
the initial system state is available, each entry of f2(l) can be 
set to the stationary distribution uo of the underlying Markov 
chain: 

Pox (V . 
u = ■ . (3) 

Pox +P10 

This problem falls into the general model of POMDP. It can 
also be considered as a restless multi-armed bandit problem 
by treating the belief value of each channel as the state of 
each arm of a bandit. Note that for a given sensing policy 
7r, the belief vectors {fl(t)}J =1 form a Markov process with 
an uncountable state space. The expectation in (f2j) is with 
respect to this Markov process which determines the reward 
process. The difficulty in obtaining the optimal policy tt* 
and characterizing its performance largely results from the 
complexity of analyzing a Markov process with uncountable 
state space. 

III. Optimal Policy vs. Myopic Policy 

A. Value Function and Optimal Policy 

Let Vt(f2(£)) be the value function, which represents the 
maximum expected total reward that can be obtained starting 
from slot t given the current belief vector f2(t). Given that 
the user takes action a and observes S a (t) in slot t, the 
reward that can be accumulated starting from slot t consists 
of two parts: the expected immediate reward E[i£ a (t)] = 
E[£ (t)] = u> a (t) and the maximum expected future reward 
V t+ x(T(Q(t)\a,S a (t))), where T(Q(t)\a, S a (t)) denotes the 
updated belief vector for slot t + 1 as given in ([T}. Averaging 
over all possible observations S a (t) and maximizing over all 
actions a, we arrive at the following optimality equations. 



v T (n(T)) 



max uj a (T) 

=1,— ,N 

max \uj a it) 



■u a (t)V t+ x (T(fi(i)M)) 
+ (l-w o (t))VS+i(T(fi(t)|a s 0))}. (4) 



arg max 1 



t=x 



7T t (0(t)) 



(t)|fi(i; 



(2) 



In theory, the optimal policy n* and its performance 
Vi(f2(l)) can be obtained by solving the above dynamic 
program. Unfortunately, this approach is computationally pro- 
hibitive due to the impact of the current action on the future 
reward and the uncountable space of the belief vector Q(t). 
Even if approximate numerical solutions are feasible, they do 
not provide insights for system design or analytical character- 
izations of the optimal performance Vi(0(l)). 

B. Myopic Policy 

A myopic policy ignores the impact of the current action 
on the future reward, focusing solely on maximizing the 
expected immediate reward K[R a (t)]. Myopic policies are thus 
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stationary: the mapping from belief vectors to actions does not 
change with time t. The myopic action a(t) and the value 
function Vj(f2(t)) of the myopic policy for a given belief 
vector fi(t) are given by 



a(t) — arg max w a (i), (5) 

a— l,--- ,N 

V t (m) = u a{t) (t)+u &{t) (t)V t+ i(T(n(t)\a(t),l)) 
+(l-«a(t)(<))Vi+i (T"(n(*)|o(<),0)). 

In general, obtaining the myopic action in each slot requires 
the recursive update of the belief vector f2(f) as given in (fl}, 
which requires the knowledge of the transition probabilities 
{pij}- In the next section, we show that the myopic policy 
has a simple semi-universal structure that does not need the 
update of the belief vector or the knowledge of the transition 
probabilities. 

IV. Structure of Myopic Sensing 

In this section, we establish the simple and robust structure 
of the myopic policy, which lays out the foundation for 
the optimality proof and performance analysis in subsequent 
sections. 

A. Structure 

The basic element in the structure of the myopic 
policy is a circular ordering AC of the channels. For 
a circular order, the starting point is irrelevant: a cir- 
cular order AC = (ni,ri2,--- ,njv) is equivalent to 
(rii, n i+ i, • • ■ , n N ,ni,n 2 , ■ ■ ■ ,n<-i) for any 1 < i < N. An 
example of a circular order is given in Fig. [2] where all N 
channels are placed on a circle in the clockwise direction. 

We now introduce the following notations. For a circular 
order AC, let —AC denote its reverse circular order, i.e., for AC = 
(n\, ri2, • • • , tin), we nave —AC = (njv, 7ij\r_i, • • • , n\) (see 
Fig. [3] for an illustration where the lower circle on the right 
shows the reverse circular order of that given by the circle on 
the left). 

For a channel i, let ij^ denote the next channel in the circular 
order AC. For example, for /C = (1, 2, • • • , N), we have ij£ = 
i + 1 for 1 < i < N and Ad 



1. 



With these notations, we present the structure of the myopic 
policy in Theorem Q] 

Theorem 1: Structure of Myopic Sensing: 
Let 0(1) = [u>x(l), - ■ ■ , £jjv(l)] denote the initial belief vector. 
The circular channel order AC(1) in slot 1 is determined by 
a descending order of 0(1) (i.e., AC(1) = (ni,ri2,-- - ,Un) 
implies that w ni (l) > w n2 (l) > ••• > u nN (l)). Let a(l) = 
argmax^i,... ^0^(1). The myopic action a(t) in slot t (t > 
1) is given as follows. 

> Case 1: pn > poi 

a(t-l), if %_i)(t-l) = l 
a{t-l)+ if 5 4(t _i ) (t-l) = 



a(t) 



(6) 



where K{t) = fC(l). 
• Case 2: pn < poi 



where K.(t) — AC(1) when t is odd and fC(t) = — fC(l) when 
t is even. 

Proof: See Appendix A. ■ 
Theorem Q] shows that the basic structure of the myopic 
policy is a round-robin scheme based on a circular ordering 
of the channels. For pn > poi, the circular order is constant: 
K.{t) = fC(l) in every slot t, where /C(l) is determined by 
a descending order of the initial belief values. The myopic 
action is to stay in the same channel when it is good (state 1) 
and switch to the next channel in the circular order when it is 
bad (see Fig. [2] for an illustration). 




1 a(t — 1) 

a(t) = { a(t-i)+ (t) 



if £a ( t-i)(t-l) = 
if S a(t _i)(t-l) = l ' 



(7) 



Fig. 2. The structure of the myopic policy for pn > poi: the circular 
order of the channels is constant and determined by the initial belief f2(l) 
(a>l(l) > 0^2(1) > ■ • • > ojjv(I) is assumed in this example, thus a(l) = 
1); the myopic policy switches to the next channel when the current one is 
in the bad state. 

In the case of pn < poi, the circular order is reversed in 
every slot, i.e., K.(t) — /C(l) when t is odd and JC(t) = — JC{1) 
when t is even, where the initial order AC(1) is determined 
by the initial belief values. The myopic policy stays in the 
same channel when it is bad; otherwise, it switches to the 
next channel in the current circular order JC(t), which is either 
/C(l) or — fC(l) depending on whether the current time t is 
odd or even. An illustrated is given in Fig. [3] 

An alternative way to see the channel switching structure 
of the myopic policy is through the last visit to each channel 
(once every channel has been visited at least once). Specif- 
ically, for pu > poi, when a channel switch is needed, the 
policy selects the channel visited the longest time ago. For 
Pn < Poi, when a channel switch is needed, the policy selects, 
among those channels to which the last visit occurred an even 
number of slots ago, the one most recently visited. If there 
are no such channels, the user chooses the channel visited the 
longest time ago (see Appendix B for a proof). 

B. Properties 

The simple structure of the myopic policy has signif- 
icant implications in both practical and technical aspects. 
Implementation-wise, the following properties of the myopic 
policy follow from its structure: belief-independence and 
model-insensitivity. Specifically, the myopic policy does not 
require the update of the belief vectors or the knowledge 
of the transition probabilities except the order of pu and 
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Pll > P01 Pll < P01 

„_ _ = / uLiPi k ,j„ if k = 1 _ = f nf=iP^v- fe+1 if »i = i 

4 ' J 1 Piu^Uk^A-, ifn = ' ^ \ /v., H ; v .., ifii=0 ' ^ 

where i = ■ ■ ■ , ijv], j = J2, ■ • ■ ,iiv] with entries equal to or 1. 




Fig. 3. The structure of the myopic policy for pn < poi : m the fi rst 
slot (t = 1), the circular order JC(l) is determined by the initial belief f2(l) 
(oji(1) > 0^2(1) > ••• > is assumed in this example, thusa(l) = 1). 

Suppose that channel 1 is in the bad state in slots 1, ■ ■ ■ , L — 2 and in the 
good state in slot L — 1. The circular order att = Lis/C(l) when L is 
odd and — /C(l) when L is even, and a(L) is the next channel in 1C(L), i.e., 
&(L) = 2 for L odd and a(L) = AT for L even. 



Poi- These properties make the myopic policy particularly 
attractive in implementation. Besides its simplicity, this semi- 
universal structure leads to robustness against model mismatch 
and variations. 

A technical benefit of this simple structure is that it provides 
the foundation for establishing the optimality and characteriz- 
ing the performance of the myopic policy as given in Sec. [V} 
IVII as well as the generalizations of the optimality proof to 
N > 2 given in [8]. The reason is that the structure allows 
us to work with a Markov reward process with a finite state 
space instead of one with an uncountable state space (i.e., 
belief vectors) as we encounter in a general POMDP. Details 
are stated in the corollary below. 

Corollary 1: Let JC(t) = (ni,ri2,--- , nj\r) (rii £ 
{1,2,- •• ,N} W) be the circular order of channels in slot 
t, where the starting point of the circular order is fixed to 
the myopic action: n\ = a(t) for all t. Then the resulting 
ordered channel states S(i) = [S ni (t), S n2 (t), ■ ■ ■ , S nN (t)]} 
form a 2^-81316 Markov chain with transition probabilities 



{li j} given in (0, and the performance of the myopic policy 
is determined by the Markov reward process (S(t), R(t)) with 
R(t) = S ni (t). 

Proof: The proof follows directly from Theorem [TJ by 
noticing that S ni (t) determines the channel ordering in S(t + 
1) and each channel evolves as independent Markov chains. 
Specifically, for pn > poi, if S ni (t) = 1, the channel ordering 
in S(t + 1) is the same as that in S(t); if S ni (t) = 0, the 
first channel (channel rii) in S(t) is moved to the last one 
in S(t + 1) with the ordering of the rest N — 1 channels 
unchanged. For p\\ < pgi, if S nx (t) = 0, the first channel in 
S(t) remains the first in S(t + 1) while the ordering of the 
rest channels is reversed; if S ni (t) — 1, the ordering of all N 
channels are reversed. The transition probabilities given in ([8]l 
thus follow. ■ 

V. Optimality of Myopic Sensing 

In this section, we establish the optimality of the myopic 
policy for N = 2. Our proof hinges on the structure of the 
myopic policy given in Theorem Q] and Corollary Q] 

Theorem 2: Optimality of Myopic Sensing: 
For N = 2, the myopic sensing policy is optimal, i.e., Vt(fi) = 
Vt(fl) for all t and Q. 

Proof: see Appendix C. ■ 

Based on extensive numerical results, we conjecture that the 
optimality of the myopic sensing policy can be generalized to 
N > 2. A recent work [8] has made partial progress towards 
proving this conjecture, by showing that the optimality holds 
for N > 2 under the condition of p\\ > poi- Furthermore, it 
is shown in [8] that if the myopic policy is optimal under the 
sum-reward criterion over a finite horizon, it is also optimal 
for other criteria such as discounted and averaged rewards 
over a finite or infinite horizon. In the case of infinite-horizon 
discounted reward, it is determined that so long as the discount 
factor is less than 0.5, the myopic policy is optimal for all N. 

VI. Performance of Myopic Sensing 

In this section, we analyze the performance of the myopic 
policy. With the optimality results, the throughput achieved by 
the myopic policy defines the performance limit of a multi- 
channel opportunistic communications system. In particular, 
we are interested in the relationship between this maximum 
throughput and the number N of channels. 

A. Uniqueness of Steady-State Performance and Its Numerical 
Evaluation 

We first establish the existence and uniqueness of the 
system steady states under the myopic policy. The steady-state 
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throughput of the myopic policy is given by 

fc:T(«(l)) 



[7(0(1)) = lim 



(9) 



where Vi:t(0(1)) is the expected total reward obtained in 
T slots under the myopic policy when the initial belief is 
0(1). From CorollaryQ] [7(0(1)) is determined by the Markov 
reward process {S(i), R(t)}. It is easy to see that the 2 N -state 
Markov chain {S(t)} is irreducible and aperiodic, thus has a 
limiting distribution. As a consequence, the limit in (0 exists, 
and the steady-state throughput [7 is independent of the initial 
belief value 0(1). 

Corollary Q] also provides a numerical approach to evalu- 
ating [7 by calculating the limiting (stationary) distribution 
of {S(i)} whose transition probabilities are given in (0. 
Specifically, the throughput [7 is given by the summation of 
the limiting probabilities of those 2 jv ~ 1 states with first entry 
= 1. This numerical approach, however, does not provide 
an analytical characterization of the throughput U in terms 
of the number N of channels and the transition probabilities 
{Pi,j}- In me next section, we obtain analytical expressions 
of U and its scaling behavior with respect to N based on a 
stochastic dominance argument. 

B. Analytical Characterization of Throughput 

1 ) The Structure of Transmission Period: From the struc- 
ture of the myopic policy we can see that the key to the 
throughput is how often the user switches channels, or equiv- 
alently, how long the user stays in the same channel. When 
Pn > Pot, the event of channel switching is equivalent to a 
slot without reward. The opposite holds when pxx < pox- a 
channel switching corresponds to a slot with reward. 

We thus introduce the concept of transmission period (TP), 
which is the time the user stays in the same channel (see 
Fig. |4j. Let Lk denote the length of the fcth TP. We then have 
a discrete-time random process {Lk}^° =1 with a state space of 
positive integers. 

channel switching 



I I I I I I l_ 

L k = 3^^ L fe+ i 

Fig. 4. The transmission period structure. 



Based on the structure of the myopic policy, we have 
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\im K - 
lim; 



s£Li(£ fc -i) 



sf =J J_ 



Pll > Poi 
Pll < Poi- 



(10) 



k-+oo — jf — uenoie the average length of a TP. 



Let L = liuiK-,00 jc — denote 
The above equation leads to 

1 - 1/L, pu > pox 

l/L, pn < pox 



U = 



(ID 



Throughput analysis is thus reduced to analyzing the average 
TP length L. For N — 2, a closed-form expression of L can 



be obtained, which leads to a closed-form expression of the 
throughput U (see Sec. IVI-B.21 >. For N > 2, lower and upper 
bounds on [7 are obtained (see Sec. IVI-B~3l >. 

2) Throughput for N = 2: From the structure of the 
myopic policy, {Lk]^ =1 form a first-order Markov chain for 
N = 2. Specifically, the distribution of Lk is determined by 
the belief value of the chosen channel in the first slot of the 
k-th TP. The latter equals to p \ k for pxx > Poi and 
Pii'°~ 1 for pn < poii where p Q ^ is the j-step transition 
probability. The transition probabilities of {ifcj^Li are thus 
given as follows. 

• For pn > poi, 



(i+l) 1-2 

Poi Pll Pw, 



i > 1, 3 = 1 
i > 1, 3 > 2- 



For pn < poi, 



P { xi +1 \ i>l, 3 = 1 

(i+l) j—2 ■ ^ 1 ■ \ o 

P10 Poo Poi, « > 1) 3 > 2 



(12) 



(13) 



As shown in Appendix D, the limiting distribution {A;}?^ 
of this countable-state Markov chain can be obtained in closed- 
form, which leads to L = YlbLi^l an( l then the throughput 
U from CLU. 

Theorem 3: For N = 2, the throughput U is given by 



U 



1 



i-pi 



i+w— pi 



P11 > Poi 
P11 < Poi 



(14) 



k 1— w'+Poi ' 

where (D and are the expected probability that the channel 
the user switches to is in state 1 when pxx > Poi an d P11 < 
Poi, respectively. They are given in $15[ and ( TToT ). 

Proof: See Appendix D. ■ 

3) Throughput for N > 2: For N > 2, {i fe }^i 1 is a 
random process with higher-order memory. In particular, for 
P11 > Poi> it is an (N — l)-th order Markov chain. As a 
consequence, closed-form expressions of L are difficult to 
obtain. Our objective is to develop lower and upper bounds 
on U, which would allow us to study the scaling behavior of 
U with respect to N, 

The approach is to construct first-order Markov chains that 
stochastically dominate or are dominated by {Lk}^° =1 - The 
stationary distributions of these first-order Markov chains, 
which can be obtained in closed-form, lead to lower and upper 
bounds on U according to ( TTTb . Specifically, for pn > poi, 
a lower bound on U is obtained by constructing a first-order 
Markov chain whose stationary distribution is stochastically 
dominated by the stationary distribution of {Lk]^^- An upper 
bound on U is given by a first-order Markov chain whose 
stationary distribution stochastically dominates the stationary 
distribution of {Lk}fS-x' Similarly, bounds on U can be 
obtained for pxx < Poi- 

Theorem 4: For N > 2, we have the following lower and 
upper bounds on the throughput U . 
> Case 1: pxx > Poi 

( <U< Z (17) 



C + (l-D + C)(l-pxi) 



1 - pn 
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Poi , (2) . . Pol f1 (Pll -POl) (1 -Pllk 

w = ^ , where p 01 ' = pooPoi + P01P11, A = — (1 - — ), (15) 

1 +Poi A 1+Poi-Pn 1 - (Pur + P11P01 

~ where p\{ = P10P01 + P11P11, B = — (1 + — - '-^f '—). (16) 



Pu 



17 + B I+P01-P11 1 - (1 -poi)(Pn -P01J 



where lj is given by (f3]l and 

C = oj (1 - (pu - 
(Pn 



D 



u> (l 



Poi) N ), 
-Poi) N+1 (l 



Pll) 



1 -Pil +P11P01 



Case 2: pu < pox 



(2) 
PlO 



E-poiH 



< U < 1 



(2) 
PlO 



E-poiG 1 



(18) 



where 



Pw = PwPoa +P11P10, 
E = p^(l+poi)+poi(l-F), 
F = (l-p i)(l-Wo) 



1 



Poiipn - Poi) 4 



2-poi 1 - (pn -poi) 2 (l -poi) 2 ^ 



G = (l-0( 



H - (l-^o)( 



Poi(pn - Poi) 



2- Poi 1 - (Pn -Poi) 2 (l -Poi) 2 



PoiCPn - Poi) 



2JV-1 



2-poi 1 - (Pn -Poi) 2 (l -Poi) 2 



« Monotonicity: in both cases, the upper bound is in- 
dependent of N while the lower bound monotonically 
approaches to the upper bound as N increases; for pw > 
Poi, the lower bound converges to the upper bound as 

N -> 00. 

Proof: See Appendix E. ■ 
Numerical results given in [6] have demonstrated the tight- 
ness of the bounds: the relative difference between the lower 
and the upper bounds is within 6% for a wide range of 
transition probabilities {pij}- 

The monotonicity of the difference between the upper and 
lower bounds with respect to N shows that the performance 
of the multi-channel opportunistic system improves with the 
number N of channels, as suggested by intuition. For pn > 
Poi, the upper bound gives the limiting performance of the 
opportunistic system when N — > 00. In Corollary |2]below, we 
show that the throughput of an opportunistic system increases 
to a constant at (at least) geometric rate as N increases. 
This result conveys an important message regarding system 
design: the throughput of a multi-channel opportunistic system 
with single-channel sensing quickly saturates as the number 
of channels increases; it is thus crucial to enhance radio 
sensing capability in order to fully exploit the communication 
opportunities offered by a large number of channels. 

Corollary 2: For p n > p 01 , the lower bound on throughput 
U converges to the constant upper bound at geometrical rate 



(pn — Poi) as N increases; for pw < poi, the lower bound 
on U converges to a constant at geometrical rate (poi — Pn) 2 - 

Proof: See Appendix F. ■ 

VII. Conclusion and Future Work 

We have considered an optimal sensing problem that is 
of fundamental interest in contexts involving opportunistic 
communications over multiple channels. We have shown that 
for independent and identically evolving channels, the myopic 
sensing policy has a simple round-robin structure, which 
obviates the need to know the exact channel parameters, 
making it extremely easy to implement in practice. We have 
proved that the myopic policy is optimal for the two-channel 
case. We have also characterized in closed-form the throughput 
performance of the myopic policy and the scaling behavior 
with respect to the number of channels. 

Future directions include sensing policies for non-identical 
channels and with multi-channel sensing. In a recent work 
[21], the existence of Whittle's index policy and the closed- 
form expression of Whittle's index have been obtained, lead- 
ing to a simple, near-optimal index policy for non-identical 
channels with multi-channel sensing. Furthermore, it is shown 
in [21] that the myopic policy is equivalent to Whittle's index 
policy when channels are identical. The results obtained in this 
paper on the myopic policy thus also apply to Whittle's index 
policy. The structure and optimality of the myopic policy is 
also extended to multichannel sensing in [22]. 

It is also of interest to consider sensing policies for multiple 
users competing for communication opportunities in multiple 
channels. Recent work on extending the myopic sensing policy 
to multi-user scenarios can be found in [23], [24]. 

Appendix A: Proof of TheoremQ] 

We prove Theorem[T|by showing that the channel a(t) given 
by (O and (0 is indeed the channel with the largest belief 
value in slot t. Specifically, we prove the following lemma. 

Lemma 1: Let a[t) = i\ be the channel determined by 
© for pn > poi an d by © f° r P11 < Poi- Let /C(i) = 
(il,«2i*" i*/y) be the circular order of channels in slot t, 
where we set the starting point to a(t) = i\. We then have, 
for any t > 1, 



(19) 



i.e., the channel given by (0 and (|7]i has the largest belief 
value in every slot t. 

To prove Lemma [T] we introduce operator r(-) for the belief 
update of unobserved channels (see (Q~|i). 

r(w) = wp u + (1 - u)poi = Poi + w(pii -Poi)- (20) 



x 



Note that t(uj) is an increasing function of w for pn > poi 
and a decreasing function of w for pn < poi- Furthermore, we 
note that the belief value 0Ji(t) of channel i in slot t is bounded 
between poi an d p\\ for any i and t > 1, and an observed 
channel achieves either the upper bound or the lower bound 
of the belief values (see (Q~|i). 

We now prove Lemma[TJby induction. For t = 1, $1% holds 
by the definition of /C(l). Assume that $1% is true for slot t, 
where /C(t) = t2, ■ • • , in) and a(<) = i\. We show that it 
is also true for slot t + 1. 

Consider first pw > poi- We have Kit + 1) = Kit) — 
(ii,i2, ■ ■ ■ , In)- When Si ± (t) = 1, we have a(t + 1) = a(t) = 
i\ from ||6). Since ui^ = Pn achieves the upper bound 

of the belief values and the order of the belief values of the 
unobserved channels remains unchanged due to the monoton- 
ically increasing property of t(oj), we arrive at $1% for t+1. 
When Si ± (t) = 0, we have a(t + 1) = i<z from ©. We again 
have ( fT9b by noticing that uj^ (t+1) = poi achieves the lower 
bound of the belief values and K(t + 1) = («2, «3, • • • , in, ii) 
when the starting point is set to a(t + 1) = %2- 

For pn < poi. lC(t+l) = -fC(t) = (ii,ijv,iiv_i, • ■ • ,i 2 )- 
When Sij^) = 0, we have a(t + 1) = a(t) = i\ from 
(0. Since w^it + 1) = p m achieves the upper bound of 
the belief values and the order of the belief values of the 
unobserved channels is reversed due to the monotonically 
decreasing property of t(ui), we have, from the induction 
assumption at t, 

Ui.it + l)>uj iN (t + l)>u iN _ 1 (t + l) > ■■■ >u i2 (t + l), 

which agrees with $1% for t + 1 and fC(t + 1) = 
(h, iN, iN-i, ■ ■ ■ 7 h)- When Si.it) = 1, we have a(t + 
1) = in from (O. We again have (fT9l l by noticing that 
uji. it+1) = pn achieves the lower bound of the belief values 
and JC(t + 1) = {iN,iN-i, • • ■ , *2, ii) when the starting point 
is set to a(t+ 1) = in- This concludes the proof of Lemma [TJ 
hence Theorem [TJ 

Appendix B: Last Channel Visits and j'-Step 
Transition Probabilities 

As commented in Sec. |IV] another way to see the channel 
switching structure of the myopic policy is through the last 
visit to each channel once every channel has been visited 
at least once. An alternative proof of this structure is based 
on properties of the j-step transition probabilities and 

P$ [25]- 



Poi 



J3) 
Pll 



Poi -Poiipu - Poi) 3 



POI +P10 



Poi +pwjpn - poi) 3 
Poi + PlO 



(21) 



(22) 



It is easy to see that for pu > poi> Poi monotonically 
increases to the stationary distribution lo as j increases. For 
Pn < Poii Pii oscillates around and converges to uj with 
pYi > ujq for even j's and p\^ < luq for odd j's (see 
Fig- |5]and[6j>. The channel switching structure thus follows by 
noticing that channel switching occurs only after observing 
for pn > pqi and after observing 1 for pn < poi. 




Fig. 5. The j-step transition probabilities of the Gilbert-Elliot channel when 
Pll > POI- 



Poi 




Fig. 6. The j-step transition probabilities of the Gilbert-Elliot channel when 
Pn < Poi- 



Appendix C: Proof of Theorem|2] 

Recall that Vt(fl) denotes the total expected reward obtained 
under the myopic policy starting from slot t. Let V t (fl;a) 
denote the total expected reward obtained by action a in slot t 
followed by the myopic policy in future slots. We first establish 
the following lemma which applies to a general POMDP/MDP. 

Lemma 2: For a POMDP over a finite horizon T, the 
myopic policy is optimal if for t = 1, ■ • • , T, 



V t (n)>V t (n;a), Va,fi. 



(25) 



Lemma [2] can be proved by backward induction. Specifically, 
the initial condition Vr(^) = Vr(fi) is straightforward. 
Assume that V t +ii£l) = V t+ i{fl). We then have, from d25l l. 

1/ t (0) = max{ J R a (fi)+ VPr[fi'|fi, a]F t+1 (ft)} 

a— 1 * — ' 

= max{i? a (fi) + Y Pr[fi'|fi,a]V t+ i(n)} = V t (Q), 

a—l * — ' 

fl' 

i.e., the myopic policy is optimal. 

We now prove Theorem|2]based on Corollary 1 . Considering 
all channel state realizations in slot t, we have 

Vt(n-,o) = Es Pr I s W = s|n]vi(n;o|S(t) =s) 

= u a + E s Pr[S(t) - s|n]Vi + i(r(n|o, a„)|S(*) - s), (26) 

where Vt_|_i(T(0|a, s a )|S(t) = s) is the conditional reward 
obtained starting from slot t + 1 given that the system state in 
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F t (l|[l,0]) = poi +piopoo^+i(2|[0,0])+piom^m(2|[0,l])+piip o^+i(2|[l,0])+pii m y t+ i(2|[l,l]), (23) 
Vt(l|[0,l]) = m +p oPio^+i(l|[0,0])+pooPiil4+i(l|[0,l])+mPio^+i(l|[l,0])+ m p 11 y t+1 (l|[l,l]). (24) 



slot t is s. From Corollary 1, we have 

V t (T(n\a,s a )\S(t-l) = s) = V t (T(n'\a,s a )\S(t-l) = s), 

(27) 

/.e., the conditional total expected reward of the myopic 
policy starting from slot t is determined by the action a in 
slot t — 1 and independent of the belief vector ft in slot 
t - 1 (note that a(t - 1) and S(f - 1) determines S(t), 
which determines the reward process). Adopting the simplified 
notation of Vt(a(t — l)|S(i — 1) = s), we further have, from 
the statistically identical assumption of channels, 



which, combined with d30l ), leads to 



V t (a(t-l) = l\S(t-l) = [si,s 2 }) 
= Vi(o(t-l) = 2|S(t-l) = [«2,si]). 



(28) 



Next we show that 



14(o(t-l) = l|S(t-l) = [l,0]) 
= Vt(o(t-l) = l|S(t-l) = [0,l])). (29) 

Assume that poi > Pn- Following the structure of the myopic 
policy, we know that the myopic action in slot t is a(t) = 2 
for the left hand side of d29l and a(t) — 1 for the right, which 
leads to ([23j and d24}. We then have (|29]l based on (|28). The 
case of pox < pn can be similarly proved. 

Consider ft — [uo 1^2] with lj% > lo 2 - The myopic action is 
thus a=l, We now establish (l25l l. From d26l i and d28l i. we 
have 



V t (Sl;a = l) = wi + 
$(fi;a = 2) 



E 

i,j£{0,l} 

i,je{Q,i} 



Pr[S(t) = [i,j]]Vt+i(l|[i,j]), 
Pr[S(t) = [» > j]]Vi + i(l|[j,i]). 



It thus follows from (O that 

t4(0;o=l)-F t (0;a = 2) 
= (w 1 -w 2 )(l + Vi +1 (l|[l,0])-V r t+ i(l|[0 J l])) 

= LJi - U) 2 

> 0. 

This concludes the proof. 

Appendix D: Proof of Theorem[3] 

Consider first pn > poi- Let R = {r^} denote the 
transition matrix of {Lk\^_ v where r.y is given in ( flZb . Let 
R(:, fc) denote the fc-th column of R. We have 

1_R(:,1) = 5M R(:,fc)=R(:,2)(p 11 ) fc - 2 , (30) 
Pw 

where 1 is the unit column vector [1, 1, ...]*. By the definition 
of stationary distribution, we have, for k = 1,2, ■ • • , 



[Ai,A 2 ,---]R(:,fe) =X k , 



Ai =1- 



A; 



\ fc-2 



(32) 



Substituting d32l into d3TT > for fc = 2 and solving for A2, we 
have A2 = u)pio, where Q is given in dl~5T >. From (l32l , we then 
have the stationary distribution as 



A fc = 



1 



k = 1 



/c=i 



(33) 



k\ k . 



which leads to (fT4-b based on (TTTT) and L = J] 
The proof for pu < poi is similar based on the transition 
probabilities given in (fL3l . 

Based on Corollary [TJ Theorem [3] can also be proved by 
calculating the stationary distribution of {S(i)}. 



Appendix E: Proof of Theorem|4] 

Case 1: pn > pot Let 10k denote the belief value of the 
chosen channel in the first slot of the fc-th TP. The length 
Lk(<jJk) of this TP has the following distribution. 



Pr[L fc ( Wfc ) = I] 



1 - Wife, 



I = 1 
I > 1 



(34) 



It is easy to see that if uo 1 > u>, then Lk(uj') stochastically 
dominates Lk(u>). 

From the round-robin structure of the myopic policy, u>). = 
Poi , where Jk = ^il^ 1 ^k—i + !• Based on the monotonic 
increasing property of the j-step transition probability p$ (see 
(fJTJ and Fig. |5}, we have uok < 0J o , where io Q is the stationary 
distribution of the Gilbert-Elliot channel given in Q. Lk{u> ) 
thus stochastically dominates Lk(u>k), and the expectation of 
the former, Lk{u> ) = 1 + t "° n , leads to the upper bound of 
U given in (Y7\ . 

Next, we prove the lower bound of U by constructing a 
hypothetical system where the initial belief value of the chosen 
channel in a TP is a lower bound of that in the real system. 
The average TP length in this hypothetical system is thus 
smaller than that in the real system, leading to a lower bound 
on U based on (fTTl i. Specifically, since ujk = p^^ and Jj, = 

E^ifc-i+l > N+L k _ x -l,w haveu; fc <^ +L *- 1_1) . 
We thus construct a hypothetical system given by a first- 
order Markov chain {L' k } k x L. 1 with the following transition 
probability n t j. 



* > 1, 3 = 1 



(N+i-l) j-2 ■ . -, • v. 



(35) 



(31) 



It can be shown that the stationary distribution of {Lk}% < L 1 
stochastically dominates that of the hypothetical system 
{L' k } k L 1 (see [6] for details). The latter can be obtained with 
the same techniques used in Appendix D. The average length 
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of L' k can thus be calculated, leading to the lower bound given 

in erg. 

Case 2: pw < poi In this case, the larger the initial 
belief of the chosen channel in a given TP, the smaller the 
average length of the TP. On the other hand, (fTTT i shows that 
U is inversely proportional to the average TP length. Thus, 
similar to the case of pu > poi, we will construct hypothetical 
systems where the initial belief of the chosen channel in a TP 
is an upper bound or a lower bound of that in the real system. 
The former leads to an upper bound on U, the latter, a lower 
bound on U . 

Consider first the upper bound. From the structure of the 
myopic policy, it is clear that when L k -i is odd, in the fc-th 
TP, the user will switch to the channel visited in the (fc — 2)-th 
TP. As a consequence, the initial belief ui k of the fc-th TP is 
given by to k = p\^ k ~ 1+1 \ When Lfc_i is even, we can show 
that uj k < p[^ k ~ 1+4 ^ ■ This is because that for N > 3 and 
even, the user cannot switch to a channel visited J L^_ 1 + 2 slots 
ago, and pf^ decreases with j for even j's and pf} > Pi\ for 
any even j and odd i (see (1221 and Fig. |6j. We thus construct 
a hypothetical system given by the first-order Markov chain 
{L' k } k *L 1 with the following transition probabilities. 



Pu , 

(i+l) j-2 

Pw Poo Poi, 

(i+4) 
Pu ) 
(i+4) j-2 

Pw Poo Poi, 



if i is odd, j = 1 
if i is odd, j > 2 
if i is even, j = 1 
if i is even, j > 2 



oo 

fe=l 



It can be shown that the stationary distribution of {L' k } 
is stochastically dominated by that of {L k } k ^ =1 . The former 
leads to the upper bound of U given in (fl~8T >. 

We now consider the lower bound. Similarly, = 
^(Ljo-i+i) w hen is odd. When L k _i is even, to find 

a lower bound on uo k , we need to find the smallest odd j such 
that the last visit to the channel chosen in the fc-th TP is j 
slots ago. From the structure of the myopic policy, the smallest 
feasible odd j is L k -i + 2N — 3, which corresponds to the 
scenario where all N channels are visited in turn from the 
(fc - N + l)-th TP to the fc-th TP with L k _ N+1 = L k _ N+2 = 
■•• = L fc „ 2 = 2. We thus have uj k > p (^-i+ 2JV - 3 ). 
We then construct a hypothetical system given by the first- 
order Markov chain {L' k } k L 1 with the following transition 
probabilities. 



(i+i) 
Pu » 

(i+l) j-2 
PlO Poo POI, 

U+2N-3) 
Pu > 

(i+2N-3) j-2 

PlO Poo Pou 



if i is odd, j = 1 
if i is odd, j > 2 
if i is even, j = 1 
if i is even, j > 2 



The stationary distribution of this hypothetical system leads to 
the lower bound of U given in ([T8T l. 

Appendix F: Proof of Corollary [2] 

Let x = \pu — poi|- For Pu > Poi, after some simplifica- 
tions, the lower bound has the form a + b/(x + c), where 
a, b,c (c ^ 02 are constants. The upper bound is a + b/c. We 
have \a+b/{x +c)~a-b/c\ ^ h/c 2 ^ N ^ oo. Thus the lower 

bound converges to the upper bound with geometric rate x. 



Forp n < poii the lower bound has the form d+e/(x 2N : + 
/), where d, e, / (/ ^ 0) are constants. It converges to d+e/f 
as N -» oo. We have \d+e/(x 2N ^+f)-d-e/f\ ^ e /( x fi) as 
N ^ oo. Thus the lower bound converges with geometric rate 
x 2 . 
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